HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

DNA barcodes enable Oxford Nanopore sequencing to sequence multiple barcoded DNA samples on a single flow cell. DNA sequences with the same barcode need to be grouped together through demultiplexing. As the number of samples increases, accurate demultiplexing becomes difficult. We introduce HycDemux, which incorporates a GPU-parallelized hybrid clustering algorithm that uses nanopore signals and DNA sequences for accurate data clustering, alongside a voting-based module to finalize the demultiplexing results. Comprehensive experiments demonstrate that our approach outperforms unsupervised tools in short sequence fragment clustering and performs more robustly than current state-of-the-art demultiplexing tools for complex multi-sample sequencing data. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03053-1.

We applied the following six evaluation indexes in evaluating hybrid clustering algorithm and related tools: (1) adjusted mutual information(AMI), (2) Fowlkes-Mallows scores(FMI), (3) accuracy(ACC), ( 4) Homogeneity(HOMO), ( 5)completeness(COMP), (6)time.We know the cluster distribution of each dataset, so we use these indexes to measure the effectiveness of clustering.The implications of indexes are described in the following paragraphs.
By using the clustering algorithm, the actual label set can be obtained.we define V = {V 1 , V 2 , ..., V C }.This means that there are C clusters in the actual clustering results, where V j represents the data number set belonging to the j-th cluster, j ∈ {1, 2, ..., C}.
We define metric M ∈ R R×C is a contingency table of U and V. m i j is a element of the i-th row, j-th column of M, m i j = U i ∩ V j .We define p i, j = m i j N (4) (1)-( 9) show the calculation process of AMI.we define M, T, P, Q is the set of point pairs (x i , x j ).The point pair in M satisfies the condition: ∃k, t, s.t.(x i , y j ) ∈ V k , and (x i , y j ) ∈ U t .The point pair in T satisfies the condition: ∃k, s.t.(x i , y j ) ∈ V k , but for ∀t, (x i , y j ) U t .The point pair in P satisfies the condition: for ∀k, (x i , y j ) V k , but ∃t, s.t.(x i , y j ) ∈ U t .The point pair in Q satisfies the condition: for ∀k, (x i , y j ) V k , and for ∀t, (x i , y j ) U t .
AMI, FMI and ACC reflect the similarity between the clustering label set generated by clustering results and the real clustering label set, so they can represent the clustering effect.HOMO is an index that reflects the local correctness of clustering results.For example, U 1 = {1, 2, 5, 6} , V 2 = {1, 5, 6}, the V 2 of homogeneity is 100%.COMP reflects the completeness of the cluster obtained by clustering results.For V 2 , although its HOMO is 100%, completeness is not, because x 6 not in V 2 .( 12)-( 16) show the calculation process of HOMO and COMP.

S2 Comparison experiment table of signal-similarity based clustering method and base space-based clustering method
In order to show that the signal-similarity based clustering method is better than the base space-based clustering method, we have generated a total of 9 data sets.The results of one of the data sets have been shown in the paper.The following eight tables show the results of the remaining 8 data sets.

S4 Pseudo code about hybrid clustering algorithm
The following 5 pseudo-codes describe the clustering process of the hybrid clustering algorithm in detail.
Algorithm 1 Get the initial clustering results.
Input: A set of t nucleotide sequences N = {n 1 , n 2 , ..., n t } sorted by decreasing length and identity Output: Clusters of sequences and their respective centers 1: while N is not empty do 2: Find the first sequence in N as center now 3: NS ← all sequences in N and be filtered by short word about the center now 5: if S is not empty then 7: for sequence x in S do 8: Alignment with center now to get similarity c 9: if c ≥ identity then 10: end for 13: Add center now to Centers end if

21: end while
Algorithm 1 shows the main steps of the initial clustering algorithm.The idea of algorithm is consistent with that of CD-HIT algorithm.First, the t nucleotide sequences are given, which are stored in the set N. The sequences in N are sorted by length, with the long sequence always coming before the short one.Then, the longest sequence is selected as the representative sequence, where a short word filter is set for the representative sequence.The sequences filtered by short words are stored in the set NS , where S is the complement of NS .The sequence in S is aligned with the representative sequence to obtain the similarity score.If the score is not less than the given identity, then this sequence and the representative sequence belong to the same cluster.After the alignment, the cluster set cluster now of the representative sequence is obtained, N = N − cluster now .The longest sequence in N is continuously selected as the representative sequence, and then the cluster set about the representative sequence is obtained until N is empty and the algorithm ends.
Algorithm 1 can be used to obtain some clusters, which need to be merged in order to obtain more accurate clustering results.According to the nucleotide sequence, we can obtain the corresponding nanopore signal, according to these signals, combined with the DTW algorithm can achieve cluster combination.We use algorithm 2 to calculate a DTW distance threshold.If the DTW distance of two sequences is less than this threshold, we consider the two sequences belong to the same class.

Algorithm 2 Get the merge threshold
Input: A set of t nucleotide sequences N = {n 1 , n 2 , ..., n t } sorted by decreasing length and nanopore signal set S = {s 1 , s 2 , ..., s t } corresponding to each nucleotide sequence a interger k Output: A merge threshold

S5 Usage of our method
To utilize our tool, users are required to provide the following data: • Raw nanopore signals: These are the original signals obtained from nanopore sequencing.
• DNA sequences after sequencing: The DNA sequences obtained through the sequencing process.
• Barcode sequences: The specific sequences corresponding to the barcodes used in the experiment.
• Adapter sequence: The adapter sequence used in the sequencing process.
The workflow involves using specific scripts as described below: Algorithm 3 Get some information about the initial clustering results  • Clustering: The user then uses the "mainHybridClustering.py"script for clustering.Here, the user needs to provide the pseudo-barcode signals and pseudo-barcode sequences as input.The output is a file containing the clustering results.

Input
• Demultiplexing: Finally, the user applies the "mainDemultiplexByClusteringRes.py"script to convert the clustering results into the final demultiplexing output.This step requires the clustering result file, pseudo-barcode signals, and the standard nanopore signals as input.The output is a file containing the demultiplexing result.
• By following this workflow, users can effectively utilize our toolkit for their data analysis needs.
Our tools and more detailed usage information can be found at the following links: for s in OS S do 13: G ←The set of all such sequences in OS S : the DT W distance between their corresponding signal and s is less than a given threshold if The DT W distance between ss and nn is less than the T hreshold then

1:
Run algorithm 1 use a very high Identity to get InitialClusters 2: Sort the clusters in InitialClusters by size (descending) 3: Pick the top one percent clusters and get their representative signals.4: Calculate all DTW distances between these representative signals, and record the maximum value as M and the minimum value as m.5: threshold = (m + M)/k Algorithm 3 shows the acquisition process of consensus signal set.Firstly, we can obtain the initial clustering result through algorithm 1.According to the size of the obtained clustering set, we can divide all clusters into two categories, namely good cluster(GoodCluster) and bad cluster.For all GoodCluster, we merge them according to the following rules: for each GoodCluster, 3 sequences are selected as representatives, corresponding nanopore signals of these sequences are found, and DTW distance is calculated.If the DTW distance between the corresponding representative signals of the two clusters is less than the threshold, then the two clusters are merged.Finally, the merged GoodCluster can be obtained, denoted as GoodCluster ′ .For each cluster set in GoodCluster ′ , we can get their representative signals, and the set of these representative signals is called ConS igS et.Algorithm 4 shows the overall process of hybrid clustering algorithm.At the beginning, we run Algorithm 2 and Algorithm 3 to get the threshold, ConS igS et and GoodCluster ′ .The sequence not in GoodCluster ′ is placed in the set OS S .If the DTW distance between the signal corresponding to the sequence in OS S and the signal in ConS igS et is less than the threshold, then the sequence belongs to the cluster corresponding to the signal in ConS igS et.If OS Sis still not an empty set, then a sequence in OS S is selected as the representative, and the corresponding nanopore signal is found, and the DTW distance between this signal and the corresponding signal of other sequences in OSS is calculated.If the distance is less than threshold, then these sequences are in the same class as the representative sequences, and so on until the OS S is empty set algorithm ends.When running Algorithm 1, many ultra-short nanopore sequences due to base-calling errors were removed in order to ensure the accuracy of subsequent consensus sequences.The purpose of Algorithm 5 is to find the nanopore signals corresponding to these ultra-short sequences and complete the clustering of these sequences by signal retrieval e.g. a signal is taken as a representative, and the DTW distance between it and other signals is calculated.If the distance is less than the threshold, then these signals belong to the same class as the representative signals.

•
Data preparation: The user employs the "mainDataPreparation.py" script to extract the barcode data.In this step, the user inputs all the aforementioned data, and the output includes pseudo-barcode signals(the barcode signals extracted by our method), pseudo-barcode sequences(the barcode sequences extracted by our method), and the standard barcode signals corresponding to the barcode sequences.
Length ← The number of sequences included in the Clusters now 21: if Length is equal to t then An algorithm to guarantee the integrity of clustering results Input: The current clusters now , A set of t nucleotide sequences N = {n 1 , n 2 , ..., n t } Output: The final Clusters of nucleotide sequences 1: NN ← The set of sequences not from Cluster in Clusters now 2: for nn in NN do 3: Randomly select the signal ss corresponding to a sequence from Cluster in Clusters now 4: Length is not equal to t then 11: for sequence from NN do 12: Cluster new ← The new cluster only contains sequence 13: Clusters now ← Clusters now ∪ Cluster new 14: end for 15: end if

Table S1 :
Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using 50 categories of 145bp nucleotide sequences as templates with 40 random seeds.Totally, 2000 nanopore raw current signals are simulated, with 50 clusters on the dataset and 40 sequences for each cluster.

Table S2 :
Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using 100 categories of 145bp nucleotide sequences as templates with 20 random seeds.Totally, 2000 nanopore raw current signals are simulated, with 100 clusters on the dataset and 20 sequences for each cluster.
TableS3: Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using 20 categories of 145bp nucleotide sequences as templates with 100 random seeds.Totally, 2000 nanopore raw current signals are simulated, with 20 clusters on the dataset and 100 sequences for each cluster.Data typeTools, methods and evaluation index Base Tool, Method Identity(%) AMI(%) FMI(%) ACC(%) HOMO(%) COMP(%) Time(min:sec) Table S4: Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using 100 categories of 95bp nucleotide sequences as templates with 20 random seeds.Totally, 2000 nanopore raw current signals are simulated, with 100 clusters on the dataset and 20 sequences for each cluster.

Table S5 :
Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using 50 categories of 95bp nucleotide sequences as templates with 100 random seeds.Totally, 2000 nanopore raw current signals are simulated, with 50 clusters on the dataset and 40 sequences for each cluster.

Table S6 :
Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using 20 categories of 95bp nucleotide sequences as templates with 100 random seeds.Totally, 2000 nanopore raw current signals are simulated, with 20 clusters on the dataset and 100 sequences for each cluster.

Table S7 :
Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using categories of 45bp nucleotide sequences as templates with 20 random seeds.Totally, 2000 nanopore raw current signals are simulated, with clusters on the dataset and 20 sequences for each cluster.
TableS8: Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using categories of 45bp nucleotide sequences as templates with 40 random seeds.Totally, 2000 nanopore raw current signals are simulated, with clusters on the dataset and 40 sequences for each cluster.

Table S9 :
Performance comparison of different clustering methods on a simulated dataset.The dataset is simulated by DeepSimulator, using 20 categories of 45bp nucleotide sequences as templates with 100 random seeds.Totally, 2000 nanopore raw current signals are simulated, with 20 clusters on the dataset and 100 sequences for each cluster.
: A set of t nucleotide sequences N = {n 1 , n 2 , ..., n t } sorted by decreasing length and identity Output: A initial classification signal set and good cluster about Initial clustering results return threshold, ConS igS et, all GoodCluster An overview of the algorithm about our clustering method Input: A set of t nucleotide sequences N = {n 1 , n 2 , ..., n t } sorted by decreasing length and nanopore signal set S = {s 1 , s 2 , ..., s t } corresponding to each nucleotide sequence Output: Clusters of nucleotide sequences 1: Run Algorithm 2 and Algorithm 3 to get threshold, ConS igS et and GoodClusterS et ′ 2: Add all elements from GoodClusterS et ′ to Clusters now 3: OS S ← The set of nucleotide sequences not in GoodClusterS et Randomly select the signal corresponding to a sequence from GoodClusterS et if The DT W distance between the signal corresponding to the nucleotide sequence S now and a signal in ConS igS et is less than threshold and the DT W distance between the signal corresponding to the nucleotide sequence S now and RS is less than threshold then ′ 4: while OS S is not empty do 5: RS ← ′ 6: 7: Add S now to GoodClusterS et ′ 8: Remove S now from OS S 9: end if 10: end while 11: if OS S is not empty then 12: